Motivation:

An analysis of almost any social media data can can be rather telling of how subgroups of a population interact with each other on a large scale. We are interested in the content of these interactions and how they vary throughout the United States over the few days that our data spans.

Related work: Anything that inspired you, such as a paper, a web site, or something we discussed in class.

Initial questions:

What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?

Data:

Source, scraping method, cleaning, etc.

# The sentiment function takes a really long time so I created a new data file so you don't have to run it
us_tweets <- read_csv("us_tweets.csv") 

#gets rid of non alphabetic characters  
us_tweets$tweet_content_stripped <- gsub("[^[:alpha:] ]", "", us_tweets$tweet_content) 


#removes all words that are 1-2 letters long
us_tweets$tweet_content_stripped <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", us_tweets$tweet_content_stripped) 

Exploratory analysis:

Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.

Our additional shiny repos can be found: here for all US and here for individual states.

Discussion:

What were your findings? Are they what you expect? What insights into the data can you make?

sentimentTotals <- data.frame(colSums(us_tweets[,c(20:27)]))

names(sentimentTotals) <- "count"

sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals),
                         sentimentTotals)

sentimentTotals
##                 sentiment count
## anger               anger 13605
## anticipation anticipation 52960
## disgust           disgust 12668
## fear                 fear 19942
## joy                   joy 46690
## sadness           sadness 21882
## surprise         surprise 22067
## trust               trust 76347
us_tweets_long <- gather(us_tweets, sentiment, count, anger:trust, 
                         factor_key = TRUE)
us_tweets$hour <- as.POSIXct(us_tweets$hour, format = " %H:%M")

ggplot(data = us_tweets, aes(x = hour)) +
  geom_histogram(stat = "count") +
  xlab("Time") + ylab("Proportion of tweets") +
  ggtitle("Number of Tweets per Hour") +
  scale_x_datetime(labels = date_format("%H:%M"))

From this graph, we noticed that we are missing some time intervals in our data set. We are not sure why this is. The website from which we obtained the data must not have scraped for these times.

us_tweets$charsintweet <- sapply(us_tweets$tweet_content, function(x) nchar(x))

ggplot(data = us_tweets, aes(x = charsintweet)) +
  geom_histogram(aes(fill = ..count..), binwidth = 8) +
  theme(legend.position = "none") +
  xlab("Characters per Tweet") + 
  ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4") + 
  xlim(0,150) + 
  ggtitle("Characters per Tweet")

ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + 
  ylab("Total Count") + 
  ggtitle("Total Sentiment Score for All Tweets in Sample")

tweet_words <- us_tweets %>% 
  unnest_tokens(word, tweet_content_stripped)

data(stop_words)

tweet_words <-  
  anti_join(tweet_words, stop_words)

tweet_words %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 200, 
                 random.order = FALSE, 
                 rot.per = 0.35,  
                 colors = brewer.pal(2, "Dark2")))

pal2 <- brewer.pal(8,"Dark2")

tweet_words %>% 
  count(word, sort = TRUE) %>% 
  top_n(10) %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  ggplot(aes(x = word, y = n)) + 
  geom_bar(stat = "identity", fill = "blue", alpha = .6) + 
  coord_flip()

hashtags <- str_extract_all(us_tweets$tweet_content, "#\\S+")
hashtags <- unlist(hashtags)
hashtags <- gsub("[^[:alnum:] ]", "", hashtags)
hashtags <- tolower(hashtags)
hashtag.df <- data.frame(table(hashtags))
hashtag.df$hashtags <- as.character(hashtag.df$hashtags)
hashtag.df$Freq <- as.numeric(as.character(hashtag.df$Freq))
hashtag.df <- arrange(hashtag.df, desc(Freq))
print(hashtag.df[1:20,])
##           hashtags  Freq
## 1              job 51511
## 2           hiring 45428
## 3             jobs 21910
## 4        careerarc 20717
## 5           retail  7454
## 6      hospitality  7311
## 7          nursing  5091
## 8       healthcare  4702
## 9         veterans  4471
## 10           sales  3310
## 11              it  2179
## 12 customerservice  1927
## 13  transportation  1568
## 14           sonic  1520
## 15   manufacturing  1476
## 16           photo  1432
## 17    businessmgmt  1348
## 18      accounting  1053
## 19     engineering   970
## 20         traffic   955

When mapping the positive scores for all tweets, we see that there is a moderate to low score through the US. At this scale, we cannot see a definitive trend at the state level. However, we do see that there are not a lot of tweets generated in the midwest or north west. There does seem that there are slightly more positive tweets from the middle of the country.

#positive tweets, ggplot
us_tweets %>%
  filter(country == "US") %>% 
  ggplot(aes(x = longitude, y = latitude, color = positive)) + 
  geom_point(alpha = .6) +
  scale_colour_gradientn(colours = rainbow(10)) +
  ggtitle("Positive Tweets")

When mapping sentiment across all US, we see an overwhelming amount of “trust” tweets. We are not quite sure what this emotion means. We found that most tweets including “job” or “jobs” mapped to the emotion “trust.” There are many tweets with those words, so it may be interesting to filter out that emotion.

#name of sentiment, ggplot
us_tweets_long %>%
  filter(country == "US") %>% 
  filter(count > 0) %>% 
  ggplot(aes(x = longitude, y = latitude, color = factor(sentiment))) + 
  geom_point(alpha = .6)+
  ggtitle("Tweet Sentiments") +
  scale_color_discrete(name="Sentiment")

When we filter out trust, we see that surprise and joy seem to be commonly tweeted emotions.

#name of sentiment, ggplot
us_tweets_long %>%
  filter(country == "US") %>% 
  filter(count > 0) %>% 
  filter(sentiment != "trust") %>% 
  ggplot(aes(x = longitude, y = latitude, color = factor(sentiment))) + 
  geom_point(alpha = .6)+
  ggtitle("Tweet Sentiments") +
  scale_color_discrete(name="Sentiment")

Due to the fact that our location column displays differences in specificity, we built a function that took the latitude and longitude of each tweet and converted it to the state in which the tweet originated from. We then proceeded to add that to our original dataset.

state_tweets = us_tweets %>%
  select("longitude", "latitude")

latlong2state <- function(state_tweets) {
    states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
    IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
    states_sp <- map2SpatialPolygons(states, IDs=IDs,
                     proj4string=CRS("+proj=longlat +datum=WGS84"))

    states_tweets_SP <- SpatialPoints(state_tweets, 
                    proj4string=CRS("+proj=longlat +datum=WGS84"))
    
    indices <- over(states_tweets_SP, states_sp)

    stateNames <- sapply(states_sp@polygons, function(x) x@ID)
    stateNames[indices]
}

state_name = latlong2state(state_tweets)

us_tweets = cbind(state_name, us_tweets)

To evaluate overall sentiment by state, we selected the appropriate columns, then grouped and summed by state, making sure to discount missing locations. Maine, Alaska and Hawaii were not included in this survey, however the 48 state count comes from Virginia and the District of Columbia recieving individual designations.

us_sentiments = us_tweets %>%
  filter(country == "US") %>%
  select(c(1, 21:30)) %>%
  na.omit(state_name) %>%
  group_by(state_name) %>%
  summarise_all(funs(sum)) %>%
  mutate(positive = as.numeric(positive),
         negative = as.numeric(negative))

The following heatmap shows the level of positive and negative sentiment across the United States during the 48 hour period of our dataset. Maine, Alaska and Hawaii are blacked out as tweets from those states were not recorded.

We can observe with these two maps that states like California and Texas are consistently the highest ranked, which can be assumed to be population related. It is interesting because the state with the lowest positive and negative sentiment scores is Washington. This could be for two reasons: population difference or that twweets have less sentimental words than other states and therefore don’t generate as strong sentiment scores.

us_sentiments %>%
  select("state_name", "negative") %>%
  rename(region = state_name, value = negative) %>%
  state_choropleth(title = "Negative Sentiment across the U.S.",
                   legend = "Sentiment Score")  

us_sentiments %>%
  select("state_name", "positive") %>%
  rename(region = state_name, value = positive) %>%
  state_choropleth(title = "Positive Sentiment Across the U.S.",
                   legend = "Sentiment Score")